Automated Statistical Modeling for Data Mining
نویسنده
چکیده
We seek to bridge the gap between basic statistical data mining tools and advanced statistical analysis software that requires an expert operator. In this paper, we explore the automation of the process of statistical data analysis via model scoring functions and search algorithms through the space of statistical models. In particular, we focus on automated modeling using generalized linear statistical models and especially models for categorical data analysis. By automating the process of selecting, building and solving statistical models, a computer can compare hundreds or thousands of possible models for a data set and produce a highly accurate statistical predictor with essentially no intermediate input from the operator. One application of this process is in expanding the statistical components of data mining packages. 1.0. Introduction. Data sets that do not regress well to a linear function of the predictor variables may be better fit by regressing to a polynomial whose terms are products of the predictor variables. In such multi-level models, where the response variable is fit to the sum of products of the predictor variables, we are able to better model the interactions between variables and arrive at a more accurate predictive model. This comes at the price of losing model generality (measured in degrees of freedom) and adding work to the computation of the maximum likelihood estimators so that, instead of least squared methods, we must use general multivariate optimization methods, such as Newton-Raphson. Our efforts focus on the family of generalized linear models (GLZs), which generalize the family of general linear models (GLMs), which, in turn, generalize linear models. Roughly speaking, the main idea behind GLZs is that there is a random response variable Y and a smooth, differentiable link function E such that E(Y) regresses to a polynomial function, g, of the predictor variables. We will focus our attention on cases where g is a multi-level hierarchical function, where summands involve an arbitrary product of predictor variables (i.e., terms contain products of predictor variables with exponents of zero or one). The term hierarchical refers to the requirement that a model containing a higher-level interaction term must also include all corresponding lower-level interactions. For example, the existence of the 3level interaction term XYZ in a model requires the existence of the terms X, Y, Z, XY, XZ, and YZ. The family of generalized linear models encompasses a great many of the data sets in which we at Wagner Associates were interested. In particular, consider the case where the predictor variables are either naturally categorical (e.g., species, gender, professional occupation, nationality, etc.) or may be binned into categories (e.g., age or weight brackets). If the categorical response variable is multinomial, taking one of several discrete values, then we may fit our sample data using logistic (a.k.a. logit) regression or using probit regression (perhaps the less popular model choice). If the categorical response variable represents an integer count, where the response is assumed to take values drawn from a Poisson distribution, then we may fit the sample data using a loglinear model. Because a great number of predictor variables are naturally categorical, categorical models have a wide range of applications, such as in the areas of credit scoring, marketing, behavioral studies, and epidemiology (see, for example, Agresti (2002) or Hosmer & Lemeshow). David Stephenson is with Daniel H. Wagner Associates, in Malvern, Pa. Email: [email protected]. This work was supported under the Office of Naval Research contract N0001499C0424. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Office of Naval Research. The process described in this paper is patent pending. 2004 David Stephenson. Unauthorized reproduction or distribution is prohibited.
منابع مشابه
A Data Mining approach for forecasting failure root causes: A case study in an Automated Teller Machine (ATM) manufacturing company
Based on the findings of Massachusetts Institute of Technology, organizations’ data double every five years. However, the rate of using data is 0.3. Nowadays, data mining tools have greatly facilitated the process of knowledge extraction from a welter of data. This paper presents a hybrid model using data gathered from an ATM manufacturing company. The steps of the research are based on CRISP-D...
متن کاملModellus: Automated Modeling of Complex Data Center Applications
The rising complexity of distributed server applications in enterprise data centers has made the tasks of modeling and analyzing their behavior increasingly difficult. This paper presents Modellus, a novel system for automated modeling of complex data center applications using statistical methods from data mining and machine learning. Modellus can automatically derive models to predict the reso...
متن کاملAutomated detection of coronavirus disease (COVID-19) by using data-mining techniques: a brief report
Background: The clinical field has vast sick data that has not been analyzed. Discovering a way to analyze this raw data and turn it into an information treasure can save many lives. Using data mining methods is an efficient way to analyze this large amount of raw data. It can predict the future with accurate knowledge of the past, providing new insights into disease diagnosis and prevention. S...
متن کاملConcept drift detection in business process logs using deep learning
Process mining provides a bridge between process modeling and analysis on the one hand and data mining on the other hand. Process mining aims at discovering, monitoring, and improving real processes by extracting knowledge from event logs. However, as most business processes change over time (e.g. the effects of new legislation, seasonal effects and etc.), traditional process mining techniques ...
متن کاملUsing stream sediment data to determine geochemical anomalies by statistical analysis and fractal modeling in Tafrash Region, Central Iran
Iranian Cenozoic magmatic belt, known as Urumieh-Dokhtar, is recognized as an important polymetallic mineralization which hosts porphyry, epithermal, and polymetallic skarn deposits. In this regard, multivariate analyses are generally used to extract significant anomalous geochemical signature of the mineral deposits. In this study, stepwise factor analysis, cluster analysis, and concentration–...
متن کاملAsk and Ye Shall Receive? Automated Text Mining of Michigan Capital Facility Finance Bond Election Proposals to Identify which Topics are Associated with Bond Passage and Voter Turnout
The purpose of this study is to bring together recent innovations in the research literature around school district capital facility finance, municipal bond elections, statistical models of conditional time-varying outcomes, and data mining algorithms for automated text mining of election ballot proposals to examine the factors that influence the probability of school districts in the state of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004